Clustering and classification

library(MASS)
library(GGally)  
library(ggplot2)
library(tidyverse)
library(corrplot)

data(Boston)
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

A glimpse at the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Variables scale quite differently.

# plot distributions and correlations of continuous variables 
p <- Boston %>% 
  ggpairs(
    mapping = aes(alpha=0.5), 
    lower = list(continuous = wrap("points", colour = "cornflower blue", size=0.3)), 
    upper = list(continuous = wrap("cor", size = 2)),
    diag = list(continuous = wrap("densityDiag", colour = "cornflower blue"))) +
 theme(axis.text.x = element_text(size=5), 
       axis.text.y = element_text(size=5))  
p

There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. Also checking the correlations.

cor_matrix <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),tl.cex=0.7)

There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.

Scale data

Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:

boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
sd(boston_sc$dis)
## [1] 1

Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.

Then we will create a categorical variable called crime which is based on the quantiles of crim variable:

brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim, breaks = brks, label = lbls, include.lowest = TRUE)
boston_sc$crime <- crime
boston_sc <- dplyr::select(boston_sc, -crim) # Remove the old crim variable
summary(boston_sc$crime)
##      low  med_low med_high     high 
##      127      126      126      127

Then let’s divide the data into training (80%) and testing sets (20%):

n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]

Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.

Linear Discriminant Model

Next I estimate a linear discriminant model with crime as the target variable. The purpose of the analysis is to identify those variables that explain whether a tract has a high or low crime rate. We have here a four-class grouping of crime rate.

A linear discriminant model assumes that explanatory variables are continuous and normally distributed given the classes defined by the target variable. Moreover, a constant variance across the explanatory variables is assumed. According to the preliminary analysis, the assumption of normality is not satisfied. I do not check whether the assumption is satisfied given the crime class but simply assume normality. The constant variance assumption is satisfied because of scaling.

The results from linear discriminant analysis are shown below. The prior probabilities show the proportions of observations belonging to the four groups in the train data. They are not exactly equal because the grouping was done with all the 506 tracts. The variable means differ across crime groups suggesting that they have an association with the crime rate.

The first linear discriminant explains 95% of the variance between the groups based on crime rate.

lda_fit  <- lda(crime ~ ., data = train_set)
lda_fit
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2698020 0.2425743 0.2178218 0.2698020 
## 
## Group means:
##                  zn      indus         chas        nox          rm        age
## low       0.8873851 -0.9055876 -0.163968511 -0.8563255  0.41868172 -0.8558901
## med_low  -0.1733183 -0.2330967  0.008892378 -0.5657986 -0.15038773 -0.3508825
## med_high -0.4170776  0.2713442  0.309288013  0.4379456  0.07981804  0.4569088
## high     -0.4872402  1.0169738 -0.055607954  1.0231636 -0.47859267  0.8072208
##                 dis        rad        tax      ptratio       black       lstat
## low       0.8526534 -0.6763157 -0.7231059 -0.397718176  0.37637602 -0.74695791
## med_low   0.3289085 -0.5564696 -0.4712364 -0.006797986  0.32107277 -0.13552006
## med_high -0.3660002 -0.3893666 -0.2626795 -0.215662347  0.06925054  0.04360256
## high     -0.8749854  1.6395837  1.5150965  0.782471277 -0.82715139  0.94726249
##                  medv
## low       0.470654841
## med_low  -0.001237102
## med_high  0.111587797
## high     -0.694051625
## 
## Coefficients of linear discriminants:
##                 LD1           LD2         LD3
## zn       0.03198102  0.7549582724 -0.87264330
## indus    0.05562842 -0.4135340867  0.38521445
## chas    -0.06437677 -0.1265580100  0.07399599
## nox      0.34801587 -0.6134364549 -1.44896506
## rm      -0.08329676 -0.1342096938 -0.22386991
## age      0.22099856 -0.3627396371 -0.26208070
## dis     -0.03361647 -0.3966916356  0.02597317
## rad      3.15590347  0.9503847172 -0.03347636
## tax      0.04789528 -0.0362850360  0.49294637
## ptratio  0.08624829 -0.0006524072 -0.20503229
## black   -0.11623554  0.0093251572  0.07659004
## lstat    0.27658104 -0.2687005785  0.35890290
## medv     0.21390952 -0.4036817063 -0.08914383
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9496 0.0372 0.0132

The LDA biplot based on the estimated model is shown below. The observations are colored on the basis of their crime group. The arrows indicate that variable rad (index of accessibility to radial highways) is a strong predictor of linear discriminant 1, while variables nox (air pollution) and zn (prop. of residential land zoned for large lots) explain linear discriminant 2.

# the function for lda biplot arrows
lda_arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train_set$crime)
colnames(train_set)
##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "crime"
# plot the lda results
plot(lda_fit, dimen = 2, col=classes, pch=classes)

lda_arrows(lda_fit, myscale = 2)

Model validation

I use the test data to validate the model, ie. to see whether the observations in the test data are correctly classified.

# save the correct classes from test data
correct_classes <- test_set$crime

# remove the crime variable from test data
test <- dplyr::select(test_set, -crime)

The table below shows (after a little calculation) that roughly 1/3 of observations lie outside the diagonal i.e. are incorrectly predicted by the model. (if prop.table used).

If addmargins() used: It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.

# predict classes with test data
lda_pred <- predict(lda_fit, newdata = test)

# Confusion Matrix and Accuracy
tab1 <- table(correct = correct_classes, predicted = lda_pred$class) %>% addmargins()  #%>% prop.table 

accuracy1 <- sum(diag(tab1))/sum(tab1)

accuracy1
## [1] 0.4142157

K-means clustering

As a final step, I run a K-means clustering analysis with Boston data set. K-means clustering divides observations into pre-defined number of clusters (K), by minimizing the distance of observations to cluster means (centroids). I first look at the distances between observations, using a popular distance measure, Euclidean distance.

library(MASS)
data('Boston')

# remove variable chas 
#Boston <- Boston %>% dplyr::select(-chas)

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

I first choose three clusters (K=3). The plot below suggests that variables black, and tax have a strong association with clusters. Many of the other variables seem to have a somewhat weaker effect, too.

# k-means clustering
km <- kmeans(Boston, centers = 2)

summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       28    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[1:10], col = km$cluster)

pairs(boston_scaled[4:6], col = km$cluster)

pairs(boston_scaled[7:9], col = km$cluster)

pairs(boston_scaled[10:12], col = km$cluster)

pairs(boston_scaled[13:14], col = km$cluster)

I search for optimal number of clusters K by inspecting how the total of within cluster sum of squares (total WCSS) changes when K changes. I let K run from 1 to 10. The optimal number of clusters is the value of K where the total WCSS drops rapidly.

The plot below shows that the optimal number of clusters is 2.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

I choose K=2 and re-run the K-means clustering algoritm. The plot below gives support to dividing this data set to two clusters. Twcss = total within cluster sum of square

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with 2 clusters
pairs(boston_scaled[1:6], col = km$cluster)

pairs(boston_scaled[7:14], col = km$cluster)

Bonus

data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda_arrows(boston_lda, myscale=2)

It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).

Super-bonus

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
summary(model_predictors)
##        zn               indus               chas               nox           
##  Min.   :-0.48724   Min.   :-1.55630   Min.   :-0.27233   Min.   :-1.464433  
##  1st Qu.:-0.48724   1st Qu.:-0.86683   1st Qu.:-0.27233   1st Qu.:-0.920756  
##  Median :-0.48724   Median :-0.18028   Median :-0.27233   Median :-0.169964  
##  Mean   :-0.02493   Mean   : 0.03261   Mean   : 0.01029   Mean   : 0.003159  
##  3rd Qu.:-0.48724   3rd Qu.: 1.01499   3rd Qu.:-0.27233   3rd Qu.: 0.613189  
##  Max.   : 3.80047   Max.   : 2.42017   Max.   : 3.66477   Max.   : 2.729645  
##        rm                age                 dis                 rad          
##  Min.   :-3.87641   Min.   :-2.333128   Min.   :-1.265817   Min.   :-0.98187  
##  1st Qu.:-0.58052   1st Qu.:-0.813528   1st Qu.:-0.843952   1st Qu.:-0.63733  
##  Median :-0.15604   Median : 0.297529   Median :-0.261737   Median :-0.52248  
##  Mean   :-0.03526   Mean   : 0.001279   Mean   :-0.005963   Mean   : 0.04009  
##  3rd Qu.: 0.47482   3rd Qu.: 0.900573   3rd Qu.: 0.674147   3rd Qu.: 1.65960  
##  Max.   : 3.55153   Max.   : 1.116390   Max.   : 3.284050   Max.   : 1.65960  
##       tax              ptratio             black              lstat         
##  Min.   :-1.30676   Min.   :-2.70470   Min.   :-3.90333   Min.   :-1.50301  
##  1st Qu.:-0.76237   1st Qu.:-0.48756   1st Qu.: 0.19756   1st Qu.:-0.76397  
##  Median :-0.39895   Median : 0.29768   Median : 0.38371   Median :-0.18527  
##  Mean   : 0.04215   Mean   : 0.05518   Mean   :-0.02865   Mean   : 0.03067  
##  3rd Qu.: 1.52941   3rd Qu.: 0.80578   3rd Qu.: 0.44062   3rd Qu.: 0.62343  
##  Max.   : 1.79642   Max.   : 1.63721   Max.   : 0.44062   Max.   : 3.54526  
##       medv         
##  Min.   :-1.90634  
##  1st Qu.:-0.63420  
##  Median :-0.17210  
##  Mean   :-0.03627  
##  3rd Qu.: 0.24651  
##  Max.   : 2.98650
summary(lda_fit$scaling)
##       LD1                LD2                  LD3          
##  Min.   :-0.11624   Min.   :-0.6134365   Min.   :-1.44897  
##  1st Qu.:-0.03362   1st Qu.:-0.3966916   1st Qu.:-0.22387  
##  Median : 0.05563   Median :-0.1342097   Median :-0.03348  
##  Mean   : 0.31843   Mean   :-0.0801401   Mean   :-0.13243  
##  3rd Qu.: 0.22100   3rd Qu.:-0.0006524   3rd Qu.: 0.07659  
##  Max.   : 3.15590   Max.   : 0.9503847   Max.   : 0.49295
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)

summary(matrix_product)
##       LD1               LD2                LD3          
##  Min.   :-4.4002   Min.   :-3.37037   Min.   :-3.37316  
##  1st Qu.:-2.6131   1st Qu.:-0.62138   1st Qu.:-0.48630  
##  Median :-1.6335   Median : 0.06645   Median : 0.12156  
##  Mean   : 0.1422   Mean   : 0.01376   Mean   : 0.05807  
##  3rd Qu.: 5.6791   3rd Qu.: 0.58766   3rd Qu.: 0.62630  
##  Max.   : 7.2669   Max.   : 3.34902   Max.   : 2.47373
# install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=classes, size=I(40))
train_km <- kmeans(train_set[,-14],centers=4)
cluster_col <- train_km$cluster

plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col, size=I(40))

The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.